Overall Map

Chapter 1: Basic Concepts

Concepts: state,action,reward,return,episode,policy

Grid-world example

Markov decision process(MDP)

One concept: state value

$v_\pi(s)=\mathbb{E}[G_t|S_t=s]$

One tool: Bellman equation

$v_\pi = r_\pi + \gamma P_\pi v_\pi$

A special Bellman equation

Two concepts: optimal policy $\pi^*$ & optimal state value

One tool: Bellman optimality equation

$v = \max_\pi{r_\pi+\gamma P_\pi v} = f(v)$

First algorithms for optimal policies

Three algorithms:

Need the environment model

Mean estimation with sampling data

$\mathbb{E}[X] \approx \bar{x} =\frac{1}{n}\sum_{i=1}^n x_i$

First model-free RL algorithms

Gap: from non-incremental to incremental

Mean estimation

Algorithm:

Gap: tabular representation to function representation

Algorithms:

State value estimation with value function approxmation(VFA):
$\min_w{J(w)} = \mathbb{E}[v_\pi(S)-\hat{v}(S,w)]$

Sarsa/W-learning with VFA

Deep W-learning

Gap: From value-based to policy-based

Metrics to define optimal policies

$J(\theta)=\bar{v_\pi},\bar{r_\pi}$
Policy gradient:

$\nabla J(\theta)=\mathbb{E}[\nabla_\theta \ln{\pi(A|S,\theta)q_\pi(S,A)}]$
Gradient-ascent algorithm(REINFORCE)

$\theta_{t+1}=\theta_t + \alpha \nabla_\theta \ln{\pi(a_t|s_t,\theta_t)q_t(s_t,a_t)}$

Gap: policy-based + value-based

Algorithms: